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Abstract 

Providing captions for videos used in online courses is an area of interest for institutions of 
higher education. There are legal and ethical ramifications as well as time constraints to 
consider. Captioning tools are available, but some universities rely on the auto-generated 
YouTube captions. This study looked at a particular type of video—the weekly informal 
news update created by individual professors for their online classes—to see if automatic 
captions (also known as subtitles) are sufficiently accurate to meet the needs of deaf 
students. A total of 68 minutes of video captions were analysed and 525 phrase-level errors 
were found. On average, therefore, there were 7.7 phrase errors per minute. Findings 
indicate that auto-generated captions are too inaccurate to be used exclusively. Additional 
studies are needed to determine whether they can provide a starting point for a process of 
captioning that reduces the preparation time. 
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Literature review 

Captions, once a rarity, are now prevalent in today’s media, especially in television 
programming. They are sometimes referred to as subtitles and are generally considered to be a 
textual representation of a video’s audio message (Caption it Yourself, n.d.) This definition is not 
entirely accurate because captions can also translate visual languages such as American Sign 
Language (ASL) to a written language such as English (Matthews, Young, Parker, & Napier, 
2010). In these cases there might be no audio track. However, for the purposes of this discussion, 
the focus is on captions that represent spoken languages. 

In addition to televised shows, captioned films are available in some movie theatres and many 
captioned videos are available on the web. There are two styles of captioning—open captions 
and closed captions. Open captions are integrated with the video and cannot be turned off, 
whereas closed captions are read by the media player and can be turned on or off according to the 
user’s preference (Clossen, 2014). Captions are usually regarded as a tool to benefit people who 
are deaf or hard of hearing although many other groups of learners, including second-language 
learners, also use them (Collins, 2013). The focus of this discussion will be on web-based videos, 
especially those used in higher education settings. 

Using videos in higher education 

The number of colleges and universities offering online classes continues to increase; 32% of 
students in the United States are reported to be enrolled in at least one (Sheehy, 2013). Many 
universities have fully online degree programmes and the trend towards fully online courses is 
likely to continue for the foreseeable future. In addition, there has been an increase in the number 
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of massive open online courses (MOOCs) offered by universities. MOOCs are free online 
courses that are offered without admission criteria (Anastasopoulos & Baer, 2013). Traditional 
classes and MOOCs often use videos to encourage students to establish connections and share 
content. The videos in these courses vary greatly—from professionally created and edited 
lectures, to informal announcement-style ‘talking heads’, to screencast tutorials. Many courses 
also embed videos that are available in popular media—for example, from TED Talks, Khan 
Academy, and others (Fichten, Asuncion, & Scapin, 2014). 

Online courses are not the only ones affected by the issue of captioning videos. For example, 
some face-to-face classes use a technique called lecture capture to record live lectures and make 
them available electronically for students to use as a review or, in some cases, as a substitute for 
class attendance (Newton, Tucker, Dawson, & Currie, 2014). These multimedia presentations 
often include the presenter’s audio and accompanying presentation slides. Many current 
approaches to lecture captioning do not provide captions for the video components (Newton et 
al., 2014). 

Rationale for captioning videos in higher education 

Many faculty members want to create courses that are accessible to all their students, but the 
issue of captioning videos goes beyond individual concern and has become a legal matter for 
universities. Two laws in the United States provide a guideline for educational institutions. 
Section 504 of the Rehabilitation Act 1973 prohibits a college from denying disabled individuals 
any benefits. Title II of the Americans with Disabilities Act 1990 says that individuals with 
disabilities may not be excluded or denied the benefits of the service of public universities and 
colleges (Anastasopoulos & Baer, 2013). Although both of these laws were written before the 
explosion of web-based multimedia, they are the basis for the US Department of Education’s 
Office for Civil Rights’ policy “... that an institution’s communications with persons with 
disabilities must be as effective as the institution’s communication with others” (Anastasopoulos 
& Baer, 2013, p. 2). There is also some ambiguity about the practical interpretation of “undue 
burden”. This term is defined in the Americans with Disabilities Act as “significant difficulty or 
expense” (Americans with Disabilities Act 1990, Title III), but is ambiguous in certain 
circumstances. In addition, the World Wide Web Consortium has established web content 
accessibility guidelines, and one of their recommendations is that all pre-recorded audio be 
captioned (Anastasopoulos & Baer, 2013). 

There are legal ramifications for universities that do not provide captioning for videos. A recent 
lawsuit by the National Association of the Deaf (NAD) accused Harvard and MIT of offering 
MOOCs and many individual videos to the public without captioning (“Lawsuits ask Harvard”, 
2015). The NAD reported that video captions for MOOCs run by Harvard and MIT were often 
missing—or present but inaccurate to the point of being unintelligible—so violating the 
Americans with Disabilities Act and the Rehabilitation Act. 

One way to relay the audio information in a video is to provide a script, usually in the form of a 
Word document or a PDF file. However, that solution does not take into account the benefits of 
synchronising the video and verbal content (Clossen, 2014; Parton & Hancock, 2009). Therefore, 
“if transcripts are the letter of the law ... then captions are the spirit [of the law]” (Clossen, 2014, 
p. 32). 

Creating captions 

As videos become commonplace in online courses offered by institutes of higher education, and 
there is a legal and ethical need to caption those videos for the benefit of all students (but 
especially those who are deaf and hard of hearing), the issue becomes one of how to provide the 
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captions. The full range of options for creating captions is outside the scope of this article, but a 
brief overview of popular techniques will serve the discussion. 

First, universities can choose to outsource the role of captioning to a professional company. This 
option can be expensive and requires lead time. It might work for key lectures, but might not be 
feasible for more frequent and informal communication (Johnson, 2014). Pay-per-use services 
such as SynWords (Dubinsky, 2014) have similar limitations. 

Second, a professor can choose to manually caption videos using one of several tools that are 
available for free or purchase. The process typically involves synching a script (pre-existing or 
created on the fly) with time points in the video. Media Access Generator (MAGpie) was the 
original free caption-authoring tool—although robust, it does require a relatively steep learning 
curve (Parton, 2004). Subtitle Workshop is another popular free tool that can be downloaded and 
used to create a caption file (Caption it Yourself, n.d.). Amara, which is browser-based, is easy to 
use. The user interface design is typical of caption-authoring tools and provides a space for the 
captions to be entered and a timeline to sync the captions with the audio. Third-party tools 
usually create a separate file for the captions, although Amara instead publishes the new 
captioned video on their own server. The need for these and other tools has diminished since 
YouTube integrated its own captioning tool. Users can now add a language track to their 
YouTube videos (Carlisle, 2010). 

The ongoing development of the You Tube auto-captioning tool has created a third path for 
professors and the focus of the current study. Instead of manually captioning videos, they can 
now take advantage of YouTube’s auto-captioning feature in which text-to-speech software 
generates the captions without human intervention (Fichten et al., 2014). This method of 
captioning is the quickest option available to professors, but the issue of accuracy has been 
debated. In a recent panel discussion during the IT Accessibility in Higher Education 
Conference, students noted that professors were relying on YouTube’s auto-generated captions 
and expressed concern about accuracy (Bennett, Wheeler, Wesstrick, Teasley, & Ive, 2015). Still, 
a national study of deaf students (N= 95), found that 85 of the participants preferred to watch 
videos with captions generated from automatic speech recognition than to have no captions 
(Shiver & Wolfe, 2015). This scenario leads to a situation where auto-generated captions may be 
seen as an acceptable alternative by deaf users, but “deaf advocacy groups could be concerned 
that organizations may attempt to substitute automatic captions [for professionally created ones] 
in order to meet legal obligations” (Shiver & Wolfe, 2015, p. 237). 

Auto-generated YouTube captioning feedback 

The US Department of Education’s Office for Civil Rights lists “accuracy of the translation” as 
one of their criteria for determining whether a university’s communication is effective for people 
who need captions, such as those who are deaf or hard of hearing (Anastasopoulos & Baer, 

2013). There is limited and varying research in the literature on the accuracy of YouTube’s auto¬ 
generated captioning in educational settings. Johnson (2014) reports “[t]he automatic captions 
are notoriously inaccurate, leading to the creation of an Internet meme known as ‘YouTube 
Automatic Caption FAI T. ’ wherein users post humorous examples of YouTube captions that 
don’t match the actual audio content” (p. 11). In response to an article on MOOCs 
(Anastasopoulos & Baer, 2013), a deaf reader responded that she was devastated to see how 
many videos in these courses were relying on YouTube’s auto-captioning because they were full 
of errors and did not have proper timing. Other researchers have been less critical. While still 
acknowledging the limitations of the tool, they report it is an easy solution that reduces the time- 
consuming work of manually creating the captions, but that the results are sometimes 
unintentionally humorous (Clossen, 2014). Suffridge & Somjit (2012) have found “[w]hile 
YouTube captioning is not 100% accurate, it does do a fairly good job” (p. 3). A frequent 
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recommendation is to start with the auto-captions and then edit to reduce the number of errors 
and fix any timing issues (Clossen, 2014; Johnson, 2014). 

The study 

The guiding question for this research study was: Do auto-generated YouTube captions meet the 
needs of students? In the spirit of Universal Design for Learning (UDL), captions can be 
beneficial to a wide range of students (Clossen, 2014); however, this study focuses primarily on 
this question through the lens of a deaf student. The term ‘deaf here refers to individuals who 
are both culturally Deaf (i.e., individuals who self-identify as part of a cultural and linguistic 
minority) and/or those who are physically deaf or hard of hearing. The term ‘meet the needs’ is 
rather ambiguous and is a topic for further discussion but in practical terms, and in this context, it 
refers to the accuracy of the captions. 

Although online college courses use many types of videos, the ones chosen for this study were 
casual weekly videos made by one professor. Weekly videos make students feel more connected 
to the instructor and are easy to produce with a cell phone or web cam (Suffridge & Somjit, 
2012). While the videos might not contain critical course content, they do provide an important 
social presence between professors and students. It would not be feasible, in terms of time or 
cost, to professionally caption these videos, yet they often contribute significantly to the positive 
atmosphere of an online course and must therefore be accessible. 

Methods 

The first step in this study was to obtain a series of bi-weekly videos that were professor-made 
and used in online courses. Because they met the study’s criteria, the author’s own materials 
were selected. All of the announcement video links were supplied for three graduate-level 
courses in the 15-week spring semester of 2015. The total number of videos created for the 
semester was 21 (seven per class). There was a total of 68 minutes of video in the 21 segments. 
All of the videos were made on a laptop with a built-in webcam and built-in microphone. No 
editing was performed on the videos other than adding basic border frames. The videos had been 
uploaded to YouTube and then embedded in Blackboard, the course management system. No 
attempt was made to use or check the auto-generated captions because no deaf students were 
enrolled and no other students expressed a need for captions. 

The next step was to analyse each video and its auto-generated captions for errors. The literature 
did not reveal a standard approach (in legal or practical terms) for determining the criteria for 
considering the captions’ accuracy. Although errors could be minor misspellings or text such as 
filler words (e.g., “um”) that did not exactly match the speaker, those issues do not commonly 
affect comprehension in isolation. Therefore, the decision was made to look at phrase-level 
errors—those that altered the meaning of the message or made the message unintelligible. Thus, 
as each video was played, a record was made of each phrasing error, but grammatical errors, 
misspellings, and minor word changes were omitted. Although there was some subjectivity in 
this approach, it provided a holistic view of the state of the videos’ captions and allowed 
researchers to focus on the critical components. 

Results 

The number of phrase errors in the video captions was substantial. A total of 525 such errors 
were recorded during the 68 minutes. This means that for every minute there were 7.7 phrases 
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that were unintelligible or altered the meaning of the message. Table 1 shows 50 of those errors. 
The complete list of 525 errors can be viewed as raw data in the project data spreadsheet. 1 

Table 1 Sample caption errors 


Audio phrase 

Captioned phrase 

Spring semester 

Springs the minister 

I’ve taught this one 

Topless one atom 

One wonderful kitty 

Wonderful kidding 

To reply to a classmate 

To rip apply to a classic am 

A movie chat session 

Move in chess session 

Classroom for the deaf 

Classroom for the debt 

New edition 

New dish 

Not getting paid from Amazon 

Have had making paper metal 

Kinda dating myself there 

And can and a mess up there 

1 had to code 

Ahead to coat 

Where 1 learned Adobe Premiere 

World anti-doping from here 

Do something fun 

Get to the summit bar 

Did for Deaf president now 

That they didn’t protect president now 

And keep building on that 

Can’t keep Billy 

Gave you link for 

Didn’t foreign still unit 

Look at 10 to 15 of them 

Like unit in anti-government 

And good stuff on blackboard 

It’s tough black 

Dr. Martin Luther King Jr. 

With the key engineer 

Now we think laptops and cell phones 

Our with the game at pops open 

And all that kind of good stuff 

And Iraq and stuff 

Can always email me 

Can only mailman 

And 1 think it relates 

And at the gate 

1 wanted to 

1 lied to 

But 1 encourage you 

Bankers you 

Google chat 

People jet 

Narrow anything down 

Row anything to him 

Here are my thoughts 

Have here’s carol come apart from 

Cost associated with 

Some pasta associated with 

Used my student loans 

Miss you London 

To all of you 

1 love you 


1 See Captioningerrors 

https://docs.google.eom/spreadsheets/d/1 IZVi74wUJH4HK9oL_2GQ[fizAcYJFZOI9drEZDRbDB4/edit?pref=2&pli=1#gid=Q 
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Alright, enjoy 

RN injury 

Happy Valentine’s Day 

Heavy bond on 

And 1 

[swear word inserted instead] 

1 choose 

iTunes 

Lesson plan 

Let them play 'em 

Hi folks 

Hyper X 

Active presenter 

Active prisoner 

Some of our own students 

The Maroons didn’t then doing 

Over spring break 

Overseas 

Hi guys 

White guys 

The big conference, ISTE 

The camera is Steve 

1 more module 

1 more macho 

5 of them will be for 

But 1 don’t know be four 

Critiques, you are going to put straight on 

Crunchy chicken constraint 

Fun chapters 

Punch actors 

Towards the end 

Toys into the 

For some doc interviews 

Person docking abuse 

Taught a whole course 

Have tomahawk or 

You can 

UK inch 

The end 

Indian 


The errors were produced in all of the videos—none had notably more, although the number 
ranged from 2.5 to 13.3 errors per minute. Table 2 breaks down the data per video. 

Table 2 Video caption errors per minute 


Video ID 

Length of video 

# of phrase errors 

Errors/minute 

(rounded) 

1 

7:25 

55 

7.3 

2 

9:41 

73 

7.5 

3 

4:52 

52 

10.4 

4 

3:03 

40 

13.3 

5 

7:01 

60 

8.6 

6 

1:22 

11 

7.3 

7 

1:49 

16 

9.1 

8 

4:11 

25 

5.9 

9 

2:24 

21 

8.4 

10 

2:01 

7 

3.5 

11 

2:00 

5 

2.5 
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12 

3:08 

15 

5.0 

13 

1:56 

10 

5.0 

14 

3:02 

15 

5.0 

15 

2:05 

16 

8.0 

16 

1:58 

7 

3.5 

17 

1:31 

20 

13.3 

18 

2:54 

34 

11.3 

19 

1:47 

14 

8.0 

20 

2:46 

18 

6.5 

21 

0:58 

11 

11.0 


68 minutes 

525 errors 

7.7 avg errors/min 


The types of errors found in this analysis reveal serious issues in the auto-captioning process. See 
Figure 1 for a sample of screenshots depicting the inaccurate subtitles. 



Figure 1 Sample of captioning error for the audio message: “You can always email me. I have the due 
dates.” 

EDTC628-MajorProject1_spring15 Retrieved from https://voutu.be/L-84wctzvRU 
© Becky Sue Parton 

In two of the 525 cases, the YouTube subtitle showed a swear word that was clearly not said by 
the professor. In other cases the captions were similar to the audio, but the meaning was altered 
significantly by a minor error. For example, the phrase ‘3 to 5 questions’ was shown as ‘35 
questions’. Some of the errors were to be expected due to the use of proper nouns for names and 
places, but these did not comprise a substantial proportion of the phrase-level errors. Many of the 
errors, as one might expect, occurred when a wrong word was substituted for the right one 
because they sounded alike—such as ‘the end’ becoming ‘Indian’. These associations make no 
sense to an individual who is deaf and does not read by ‘sounding the words out’. In addition to 
the phrase-level errors, the grammatical and syntactical errors were too numerous to consider for 
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this study. The filler word “um” was often displayed as “am”, spellings were at times displayed 
as short cuts (e.g., “r” for “are”), tense was often shown wrongly, words such as “two” and “to” 
were used as though they had the same meaning, and so on. 

Discussion and limitations 

Some limitations to this study might have affected the generalisability of the results. Only one 
professor’s videos were analysed; the study does not, therefore, take into account other speakers’ 
accents, which could influence the phrase error rate. The sound quality and the equipment used 
to record the videos was typical of the setup that a professor would use to record weekly news 
updates, but different software, microphones, and speaker positioning could lead to a different 
rate of accuracy for the auto-generated captions. The issue of sound quality could be the basis for 
a future study. In addition, this study focused solely on bi-weekly informal videos, but professors 
often create a wide range of materials for their classes, including narrated screencasts, mini 
lectures, and feedback clips. It would be interesting to see how other types of video compare in 
an analysis of captioning errors. A recommendation for a future study would be to involve deaf 
individuals in the evaluation process to provide feedback and insight. 

Professors are often under time constraints when developing and/or teaching a course (Freeman, 
2015) so it is imperative to find a balance between the need for reliable captions and the ability to 
provide those captions quickly. However, results from this study indicate that, in most situations, 
auto-generated captions might not ‘meet the needs’ of deaf students in terms of providing 
accurate subtitles. In practical terms, the 7.7 phrases per minute that were unintelligible or altered 
the meaning of the message meant that the essence of the message was not understandable. The 
errors were so frequent that they were more than distracting—they were a barrier to 
communication. Without editing, the auto-captions would not appear to meet the Office for Civil 
Rights criteria for communication that is as effective for people with disabilities as for those 
without. Universities are therefore unlikely to be meeting their legal obligation to provide 
accessible material. 

Although captions created entirely by a human may be ideal—especially when the content is 
highly technical—edited auto-generated captions could play a role in conversational-style, 
weekly news videos created by professors. More research needs to be conducted to see if the 
time requirement for editing the auto-generated captions is feasible compared with manual 
captioning. Although the concept of crowdsourcing captioning (whereby other students in the 
class modify auto-generated captions) is a very new idea, it could also play a role in future 
discussions on time management and legal ramifications (Deshpande, Tuna, Subhlok, & Barker, 
2014). 

It would also be interesting to study the degree to which speech-to-text engines have become 
more accurate over time (it is 5 years since YouTube introduced the auto-captioning feature). An 
investigation could look at videos that have produced inaccurate captioning results in the past 
and see if the same errors occur if they are re-captioned today. An additional study could focus 
on the accuracy of the translations that are used in subtitles for other languages. 

Broader implications 

This study looked at captioning and accessibility primarily in relation to the needs of deaf 
individuals. However, there are implications for a far wider range of students. The concept of 
Universal Design for Learning (UDL) is that course materials are set up ahead of time to 
incorporate learning paths for everyone, rather than accommodating a specific user later on 
(Poothullil, Sahasrabudhe, Chavan, & Toppo, 2013; Tobin, 2014). The three principles of UDL 
are: 1) to provide multiple means of representation; 2) to provide multiple means of action and 
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expression, and; 3) to provide multiple means of engagement (Three Principles ofUDL, n.d.) 
These principles can also apply to the concept of captioning. According to Tobin (2014, p. 17) 
“[cjaptions can help the vast majority of students”. Tobin identifies some of these students as 
second-language learners, those who are studying in quiet places such as a library, and those who 
process content better via text. 

Accessibility affects students of every nationality. While many countries have legal requirements 
for captioning for both the general public and students, others do not. In India, for example, there 
is no mandatory captioning (Poothullil et al., 2013), although there is recognition of the benefits 
of captioning—including as a tool for reading practice to combat the high illiteracy rate 
(Poothullil et al., 2013). One can see that, if captions are to be used in this manner, they must be 
accurate. Time and cost, however, remain a concern for many. For example, in Japan’s corporate 
sector there is a desire to provide real-time captioning that is not as costly as a stenographer. One 
current research study seeks to combine automated speech recognition software with manual 
editing that can be accomplished by a non-expert rather than a trained stenographer (Takagi, Itoh, 
& Shinkawa, 2015). This scenario appears similar to that of the professor (a non-expert) 
combining their manual edits with the auto-captioning results. 

Another broad implication of this study relates to the idea of meeting the needs of students. Does 
the (in)accuracy of video captioning truly embrace that concept? Even accurate captions will not 
meet students’ needs if they cannot read and comprehend them. “Producers of captions and 
educators have both been concerned whether individuals who are deaf are able to understand 
captions that are presented at relatively fast speeds and that sometimes contain complex 
grammatical forms” (Stinson & Stevenson, 2013, p. 453). The limited reading proficiency of 
some people who are deaf, and for whom English is often a second language, has long been 
noted to correlate with their ability to comprehend captions (Cambra, Silverstre, & Leal, 2009; 
Stinson & Stevenson, 2013). Multiple studies have focused on modifying captions to address this 
issue; for example, by reducing language complexity in the captions, slowing the caption rate, 
and embedding expanded information in the captions. This extra information might be hyperlinks 
to define key words, or to provide illustrations (Stinson & Stevenson, 2013). Other researchers 
have argued that the way to ensure effective communication is to provide an interpreted video 
when the student’s primary language is ASL (Parton & Hancock, 2009). (An interpreted video is 
one in which a human signer or, in some cases, an animated avatar, translates the audio content 
into a particular sign language.) Although such efforts are outside the scope of the current study, 
it is worth considering whether captions—auto-generated or not, much like script files, may be 
serving the letter, but not the spirit, of the law. 

In practical terms, these extended measures are probably too complicated to perform on routine 
weekly video updates produced by professors who often have little or no technical background or 
experience in working with individuals who are deaf. Thus, given the time and technical 
constraints, YouTube’s auto-generated captioning may be a viable start to a solution for 
professors who want to create informal, accessible video updates. However, because it would not 
meet the legal requirements established by universities in many countries, nor fulfil the spirit of 
UDL, it is only a partial solution and should not be relied on exclusively. 
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